Take-home 1 - Geospatial Analytics for Public Good

Author

Muhamad Ameer Noor

Published

December 2, 2023

Modified

December 2, 2023

1 Overview

As cities upgrade to digital systems, like smart buses and GPS, a lot of data is created about how people move around. This data is super useful for understanding patterns in both space (where) and time (when). For example, on public buses, smart cards and GPS devices collect information about routes and how many people are riding. This data holds clues about how people behave in the city, and understanding these patterns can help manage the city better and give an edge to transportation providers.

But, often, this valuable data is only used for basic things like tracking and mapping using Geographic Information System (GIS) applications. This is because regular GIS tools struggle to fully analyze and model this kind of complex spatial and time-related data.

In this exercise, we’re going to focus on bus trips during specific peak hours:

Peak hour period Bus tap on time
Weekday morning peak 6am to 9am
Weekday afternoon peak 5pm to 8pm
Weekend/holiday morning peak 11am to 2pm
Weekend/holiday evening peak 4pm to 7pm

To better understand these movement patterns and behaviors, we’re going to use something called Exploratory Spatial Data Analysis (ESDA). It’s like a powerful tool that helps us tackle complex challenges in society. In this project, we’re going to use specific methods like Local Indicators of Spatial Association (LISA) and Emerging Hot Spot Analysis (EHSA) to reveal the interesting spatial and time-related mobility patterns of people who take public buses in Singapore. These methods will help us dig deep into the data and uncover valuable insights.

Local Indicators of Spatial Association (LISA) statistics serve two primary purposes. Firstly, they act as indicators of local pockets of nonstationarity, or hotspots. Secondly, they are used to assess the influence of individual locations on the magnitude of a global statistic. In simpler terms, LISA helps in identifying specific areas (like hotspots) that differ significantly from their surrounding areas and evaluates how these individual areas contribute to the overall pattern observed in a larger region.

Emerging Hot Spot Analysis (EHSA) is a type of spatial statistical analysis that examines data over space and time to identify trends and patterns, specifically focusing on the detection of statistical hot and cold spots within a study region. This method is applicable in various fields, including crime patterns, disease outbreaks, and environmental issues. EHSA allows users to set parameters and then determines whether identified hot or cold spots are persistent, increasing, or decreasing in a given area.

the content of the following panel explained what aspatial and geospatial data are used in this project.

Monthly Passenger Volume by Origin Destination Bus Stops:

  • August, September and October 2023 Period

  • downloaded from LTA DataMall via API

  • csv format.

  • Columns/Fields in the dataset includes YEAR_MONTH, DAY_TYPE, TIME_PER_HOUR, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE, and TOTAL_TRIPS.

Two geospatial data in shp format are used in this project, which includes:

  • Bus Stop Location from LTA DataMall. It provides information about all the bus stops currently being serviced by buses, including the bus stop code (identifier) and location coordinates.

  • hexagon, a hexagon layer of 500m (perpendicular distance between the centre of the hexagon and its edges.) Each spatial unit is regular in shape and finer than the Master Plan 2019 Planning Sub-zone GIS data set of URA.

  • Uniform Distances Everywhere: Think of hexagons as honeycomb cells. Each cell (hexagon) touches its neighbors at the same distance from its center. It’s like standing in the middle of a room and being the same distance from every wall, making it easier to measure and compare things.

  • Outlier-Free Shape: Hexagons are like well-rounded polygons without any pointy tips. Sharp corners can create odd spots in data, but hexagons smoothly cover space without sticking out anywhere. This helps prevent weird data spikes that don’t fit the pattern.

  • Consistent Spatial Relationships: Imagine a beehive where every hexagon is surrounded by others in the same pattern. This regular pattern is great for analyzing data because you can expect the same relationships everywhere, making the data predictable and easier to work with.

  • Ideal for Non-Perpendicular Features: Real-world features like rivers and roads twist and turn. Squares can be awkward for mapping these, but hexagons, which are more circular, can follow their flow better. This way, a hexagon-based map can mimic the real world more closely than a checkerboard of squares.

Summarized from: Dhuri, and Sekste & Kazakov.

2 Preparation

Before starting with the analysis, we have to load the library (and install it if it does not exist in the environment yet), import the data, and process the data

2.1 Import Library

Code
pacman::p_load(tmap, sf, tidyverse, sfdep, knitr, Hmisc, mapview, DT)

pacman::p_load(kableExtra, DT, ggplot2, patchwork, ggrain)

Explanations for the imported library:

  • tmap for visualizing geospatial

  • sf for handling geospatial data

  • tidyverse for handling aspatial data

  • sfdep for computing spatial weights, global and local spatial autocorrelation statistics, and

  • knitr for creating html tables

  • Hmisc for summary statistics

  • mapview for interactive map backgrouds

  • DT library to create interactive html tables

Unused Yet - kableExtra and DT for formatting of dataframes

  • ggplot2, patchwork and ggrain for visualising attributes

  • urbnthemes for consistent plot theming

  • ggplot2 to plot graphs

  • plotly to plot interactive graphs

2.2 Import and Setup the Data

Aspatial

the following code will import all the origin destination bus data and check a sample dataframe. The process also involved transforming georeference data type into factors for easing compatibility issue and more efficient processing.

Code
# Load each csv file
odb8 <- read_csv("../data/aspatial/origin_destination_bus_202308.csv.gz")

# change georeference data type into factors
odb8 <- odb8 %>%
  mutate(
    ORIGIN_PT_CODE = as.factor(ORIGIN_PT_CODE),
    DESTINATION_PT_CODE = as.factor(DESTINATION_PT_CODE)
  )

# check the dataframe
describe(odb8)
odb8 

 7  Variables      5709512  Observations
--------------------------------------------------------------------------------
YEAR_MONTH 
       n  missing distinct    value 
 5709512        0        1  2023-08 
                  
Value      2023-08
Frequency  5709512
Proportion       1
--------------------------------------------------------------------------------
DAY_TYPE 
       n  missing distinct 
 5709512        0        2 
                                            
Value               WEEKDAY WEEKENDS/HOLIDAY
Frequency           3279836          2429676
Proportion            0.574            0.426
--------------------------------------------------------------------------------
TIME_PER_HOUR 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
 5709512        0       24    0.997    14.07    5.949        6        7 
     .25      .50      .75      .90      .95 
      10       14       18       21       22 

lowest :  0  1  2  3  4, highest: 19 20 21 22 23
--------------------------------------------------------------------------------
PT_TYPE 
       n  missing distinct    value 
 5709512        0        1      BUS 
                  
Value          BUS
Frequency  5709512
Proportion       1
--------------------------------------------------------------------------------
ORIGIN_PT_CODE 
       n  missing distinct 
 5709512        0     5067 

lowest : 01012 01013 01019 01029 01039, highest: 99139 99161 99171 99181 99189
--------------------------------------------------------------------------------
DESTINATION_PT_CODE 
       n  missing distinct 
 5709512        0     5071 

lowest : 01012 01013 01019 01029 01039, highest: 99139 99161 99171 99181 99189
--------------------------------------------------------------------------------
TOTAL_TRIPS 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
 5709512        0     3544    0.982    21.04    33.59        1        1 
     .25      .50      .75      .90      .95 
       2        4       13       38       74 

lowest :     1     2     3     4     5, highest: 30799 31669 32508 33424 35049
--------------------------------------------------------------------------------
Code
# Load each csv file
odb9 <- read_csv("../data/aspatial/origin_destination_bus_202309.csv.gz")

# change georeference data type into factors
odb9 <- odb9 %>%
  mutate(
    ORIGIN_PT_CODE = as.factor(ORIGIN_PT_CODE),
    DESTINATION_PT_CODE = as.factor(DESTINATION_PT_CODE)
  )


# check the dataframe
describe(odb9)
odb9 

 7  Variables      5714196  Observations
--------------------------------------------------------------------------------
YEAR_MONTH 
       n  missing distinct    value 
 5714196        0        1  2023-09 
                  
Value      2023-09
Frequency  5714196
Proportion       1
--------------------------------------------------------------------------------
DAY_TYPE 
       n  missing distinct 
 5714196        0        2 
                                            
Value               WEEKDAY WEEKENDS/HOLIDAY
Frequency           3183300          2530896
Proportion            0.557            0.443
--------------------------------------------------------------------------------
TIME_PER_HOUR 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
 5714196        0       23    0.997    14.06     5.94        6        7 
     .25      .50      .75      .90      .95 
      10       14       18       21       22 

lowest :  0  1  2  4  5, highest: 19 20 21 22 23
--------------------------------------------------------------------------------
PT_TYPE 
       n  missing distinct    value 
 5714196        0        1      BUS 
                  
Value          BUS
Frequency  5714196
Proportion       1
--------------------------------------------------------------------------------
ORIGIN_PT_CODE 
       n  missing distinct 
 5714196        0     5067 

lowest : 01012 01013 01019 01029 01039, highest: 99139 99161 99171 99181 99189
--------------------------------------------------------------------------------
DESTINATION_PT_CODE 
       n  missing distinct 
 5714196        0     5072 

lowest : 01012 01013 01019 01029 01039, highest: 99139 99161 99171 99181 99189
--------------------------------------------------------------------------------
TOTAL_TRIPS 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
 5714196        0     3274    0.981    19.59    31.01        1        1 
     .25      .50      .75      .90      .95 
       2        4       12       35       69 

lowest :     1     2     3     4     5, highest: 27356 28248 28510 30096 31674
--------------------------------------------------------------------------------
Code
# Load each csv file
odb10 <- read_csv("../data/aspatial/origin_destination_bus_202310.csv.gz")

# change georeference data type into factors
odb10 <- odb10 %>%
  mutate(
    ORIGIN_PT_CODE = as.factor(ORIGIN_PT_CODE),
    DESTINATION_PT_CODE = as.factor(DESTINATION_PT_CODE)
  )

# check the dataframe
describe(odb10)
odb10 

 7  Variables      5694297  Observations
--------------------------------------------------------------------------------
YEAR_MONTH 
       n  missing distinct    value 
 5694297        0        1  2023-10 
                  
Value      2023-10
Frequency  5694297
Proportion       1
--------------------------------------------------------------------------------
DAY_TYPE 
       n  missing distinct 
 5694297        0        2 
                                            
Value               WEEKDAY WEEKENDS/HOLIDAY
Frequency           3259419          2434878
Proportion            0.572            0.428
--------------------------------------------------------------------------------
TIME_PER_HOUR 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
 5694297        0       23    0.997    14.04    5.933        6        7 
     .25      .50      .75      .90      .95 
      10       14       18       21       22 

lowest :  0  1  2  4  5, highest: 19 20 21 22 23
--------------------------------------------------------------------------------
PT_TYPE 
       n  missing distinct    value 
 5694297        0        1      BUS 
                  
Value          BUS
Frequency  5694297
Proportion       1
--------------------------------------------------------------------------------
ORIGIN_PT_CODE 
       n  missing distinct 
 5694297        0     5073 

lowest : 01012 01013 01019 01029 01039, highest: 99139 99161 99171 99181 99189
--------------------------------------------------------------------------------
DESTINATION_PT_CODE 
       n  missing distinct 
 5694297        0     5077 

lowest : 01012 01013 01019 01029 01039, highest: 99139 99161 99171 99181 99189
--------------------------------------------------------------------------------
TOTAL_TRIPS 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
 5694297        0     3495    0.982    20.76    33.13        1        1 
     .25      .50      .75      .90      .95 
       2        4       12       37       73 

lowest :     1     2     3     4     5, highest: 30985 31349 32355 35931 36668
--------------------------------------------------------------------------------

Explanations for each field in the data can be found in the following metadata.

  • YEAR_MONTH: Represent year and Month in which the data is collected. Since it is a monthly data frame, only one unique value exist in each data frame.

  • DAY_TYPE: Represent type of the day which classified as weekdays or weekends/holidays.

  • TIME_PER_HOUR: Hour which the passenger trip is based on, in intervals from 0 to 23 hours.

  • PT_TYPE: Type of public transport, Since it is bus data sets, only one unique value exist in each data frame (i.e. bus)

  • ORIGIN_PT_CODE: ID of origin bus stop

  • DESTINATION_PT_CODE: ID of destination bus stop

  • TOTAL_TRIPS: Number of trips

Geospatial

the following panel show the import process for the bus stop geospatial data. The geospatial layer of the data shows point location of bus stops across Singapore.

As previously done, while reading the data, categorical data that can serve as reference are converted into factors for easing compatibility issue and more efficient processing.

Code
busstop <- st_read(
    dsn = "../data/geospatial",
    layer = "BusStop"
  ) %>%
  mutate(
    BUS_STOP_N = as.factor(BUS_STOP_N),
    BUS_ROOF_N = as.factor(BUS_ROOF_N),
    LOC_DESC = as.factor(LOC_DESC)
  )
Reading layer `BusStop' from data source `C:\ameernoor\ISSS624\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Code
# check the output
glimpse(busstop)
Rows: 5,161
Columns: 4
$ BUS_STOP_N <fct> 22069, 32071, 44331, 96081, 11561, 66191, 23389, 54411, 285…
$ BUS_ROOF_N <fct> B06, B23, B01, B05, B05, B03, B02A, B02, B09, B01, B16, B02…
$ LOC_DESC   <fct> OPP CEVA LOGISTICS, AFT TRACK 13, BLK 239, GRACE INDEPENDEN…
$ geometry   <POINT [m]> POINT (13576.31 32883.65), POINT (13228.59 44206.38),…
Check the unique values

count of unique/distinct values are:

Code
n_distinct(busstop$BUS_STOP_N)
[1] 5145

count of unique/distinct values are:

Code
n_distinct(busstop$BUS_ROOF_N)
[1] 97

count of unique/distinct values are:

Code
n_distinct(busstop$LOC_DESC)
[1] 4559

count of unique/distinct values are:

Code
n_distinct(busstop$geometry)
[1] 5160

Explanations for each field in the data can be found in the following metadata.

  • BUS_STOP_N: The unique identifier for each bus stop.
  • BUS_ROOF_N: The identifier for the bus route or roof associated with the bus stop.
  • LOC_DESC: Location description providing additional information about the bus stop’s surroundings.
  • geometry: The spatial information representing the location of each bus stop as a point in the SVY21 projected coordinate reference system.

To ensure that geospatial data from different sources can be processed together, both have to be projected using similar coordinate systems and be assigned the correct EPSG code based on CRS. The st_crs function is used to check for ESPG Code and Coordinate System of both geospatial files. For Singapore, the coordinate system is SVY21 with EPSG 3414 (source: epsg.io). The following code chunk output shows how the CRS was initially not 3414, then corrected using st_set_crs. Simultaneously, the dataframe is also prepared for joining process.

Code
# Check the current Coordinate Reference System (CRS) of the busstop dataset
print("Current CRS of busstop dataset:")
[1] "Current CRS of busstop dataset:"
Code
print(st_crs(busstop))
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
Code
# Assign a new EPSG code (Spatial Reference ID)
busstop <- st_set_crs(
   busstop, 
   3414
) %>%
# Rename the bus stop origin column for easier joining with the main dataframe
mutate(
   ORIGIN_PT_CODE = as.factor(BUS_STOP_N)
) %>%
select(
   ORIGIN_PT_CODE, 
   LOC_DESC,
   geometry
) %>%
# Change all column names to lowercase for consistency
rename_with(
   tolower, everything()
)

# Confirm the updated EPSG code after the assignment
print("Updated CRS of busstop dataset:")
[1] "Updated CRS of busstop dataset:"
Code
print(st_crs(busstop))
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

2.3 Data Wrangling

check for duplicates

In this step, we inspect the dataset for duplicate entries. This precautionary step is crucial to avoid the inadvertent repetition of records, which could lead to the overcounting of passenger trips. By identifying and addressing duplicates at this stage, we ensure the accuracy and reliability of our analysis, laying the groundwork for more precise and trustworthy results in the subsequent analytical processes. Checking for duplicates is a fundamental practice that contributes to the integrity of the data and the overall robustness of the geospatial analysis.

the following code is used to check duplicates and show how many duplicates exist in each odb.

Code
# Count the number of rows in each dataframe with duplicates
count_duplicate_rows <- function(df, df_name) {
  df %>%
    group_by_all() %>%
    filter(n() > 1) %>%
    ungroup() %>%
    summarise(df_name = df_name, row_count = n())
}

# Apply the function to each dataframe
duplicate1 <- count_duplicate_rows(odb8, "odb8")
duplicate2 <- count_duplicate_rows(odb9, "odb9")
duplicate3 <- count_duplicate_rows(odb10, "odb10")

# Combine the results
all_duplicates <- bind_rows(duplicate1, duplicate2, duplicate3)

# Print the counts
print(all_duplicates)
# A tibble: 3 × 2
  df_name row_count
  <chr>       <int>
1 odb8            0
2 odb9            0
3 odb10           0

The result shows that there are no duplicated data in the origin destination bus dataset. Note that the checking was based on a very loose criteria of duplicate, in which duplicated rows need to have the same value across all columns.

The following codes

Code
# Subset rows where origin_pt_code has duplicates
duplicates <- busstop[duplicated(busstop$origin_pt_code) | duplicated(busstop$origin_pt_code, fromLast = TRUE), ]

# Display the rows with duplicate origin_pt_code
kable(head(duplicates, n=32))
origin_pt_code loc_desc geometry
149 58031 OPP CANBERRA DR POINT (27089.69 47570.9)
166 62251 Bef Blk 471B POINT (35500.54 39943.41)
278 47201 NA POINT (22616.75 47793.68)
338 58031 OPP CANBERRA DR POINT (27111.07 47517.77)
501 22501 Blk 662A POINT (13489.09 35536.4)
751 82221 BLK 3A POINT (35323.6 33257.05)
1321 68091 AFT BAKER ST POINT (32164.11 42695.98)
1609 43709 BLK 644 POINT (18963.42 36762.8)
1937 97079 OPP ST. JOHN’S CRES POINT (44144.57 38980.25)
2035 82221 Blk 3A POINT (35308.74 33335.17)
2038 97079 OPP ST. JOHN’S CRES POINT (44055.75 38908.5)
2092 22501 BLK 662A POINT (13488.02 35537.88)
2237 62251 BEF BLK 471B POINT (35500.36 39943.34)
2656 68099 BEF BAKER ST POINT (32154.9 42742.82)
2773 52059 OPP BLK 65 POINT (30770.3 34460.06)
2970 67421 CHENG LIM STN EXIT B POINT (34548.54 42052.15)
3126 11009 Ghim Moh Ter POINT (23101.34 32594.17)
3156 53041 Upp Thomson Road POINT (28105.8 37246.76)
3158 53041 Upp Thomson Road POINT (27956.34 37379.29)
3255 77329 BEF PASIR RIS ST 53 POINT (40765.35 39452.18)
3261 77329 Pasir Ris Central POINT (40728.15 39438.15)
3264 96319 Yusen Logistics POINT (42187.23 34995.78)
3265 96319 YUSEN LOGISTICS POINT (42187.23 34995.78)
3303 52059 BLK 219 POINT (30565.45 36133.15)
3411 43709 BLK 644 POINT (18952.02 36751.83)
3470 51071 MACRITCHIE RESERVOIR POINT (28311.27 36036.92)
3472 51071 MACRITCHIE RESERVOIR POINT (28282.54 36033.93)
3665 11009 GHIM MOH TER POINT (23100.58 32604.36)
3752 68091 AFT BAKER ST POINT (32038.84 43298.68)
3753 68099 BEF BAKER ST POINT (32004.05 43320.34)
4624 47201 W’LANDS NTH STN POINT (22632.92 47934)
5095 67421 CHENG LIM STN EXIT B POINT (34741.77 42004.21)

The result shows that there are some duplicated data in the geospatial bus stop dataset. This might interfere with the joining data process and generated double count on later on. Note that the checking was based on the origin_pt_code field only, because it will be the basis of reference for joining the dataset later on. At a glance, the table above also show that, the duplicated code are actually located near each other. Therefore, removing the duplicates can be considered safe as the remaining bus stop code can represent the deleted one. The folowing code chunk will execute the duplicate removal and show the result where number of rows have reduced.

Code
# Keep one row of the duplicates in the original dataset
busstop <- busstop[!duplicated(busstop$origin_pt_code) | duplicated(busstop$origin_pt_code, fromLast = TRUE), ]

# Display the resulting dataset
glimpse(busstop)
Rows: 5,145
Columns: 3
$ origin_pt_code <fct> 22069, 32071, 44331, 96081, 11561, 66191, 23389, 54411,…
$ loc_desc       <fct> OPP CEVA LOGISTICS, AFT TRACK 13, BLK 239, GRACE INDEPE…
$ geometry       <POINT [m]> POINT (13576.31 32883.65), POINT (13228.59 44206.…

categorical peak time variable

on the interest of analyzing the peak time, the following code will assign new column that categorize peak time, filter data that is not on peak time, and show a sample of the output.

Code
# Function to categorize and aggregate trips
categorize_and_aggregate <- function(odb) {
  odb <- odb %>%
    # Categorize trips under periods based on day and timeframe
    mutate(period = case_when(
      DAY_TYPE == "WEEKDAY" & TIME_PER_HOUR >= 6 & TIME_PER_HOUR <= 9 ~ "Weekday morning peak",
      DAY_TYPE == "WEEKDAY" & TIME_PER_HOUR >= 17 & TIME_PER_HOUR <= 20 ~ "Weekday evening peak",
      DAY_TYPE == "WEEKENDS/HOLIDAY" & TIME_PER_HOUR >= 11 & TIME_PER_HOUR <= 14 ~ "Weekend/holiday morning peak",
      DAY_TYPE == "WEEKENDS/HOLIDAY" & TIME_PER_HOUR >= 16 & TIME_PER_HOUR <= 19 ~ "Weekend/holiday evening peak",
      TRUE ~ "non-peak"
    )) %>%
    # Only retain needed periods for analysis
    filter(period != "non-peak") %>%
    # Compute the number of trips per origin bus stop per month for each period
    group_by(YEAR_MONTH, period, ORIGIN_PT_CODE) %>%
    summarise(num_trips = sum(TOTAL_TRIPS)) %>%
    # Change all column names to lowercase
    rename_with(tolower, everything()) %>%
    ungroup()

  return(odb)
}

# Apply the function to each dataset
odb8 <- categorize_and_aggregate(odb8)
odb9 <- categorize_and_aggregate(odb9)
odb10 <- categorize_and_aggregate(odb10)

# sample the result
glimpse(odb10)
Rows: 20,072
Columns: 4
$ year_month     <chr> "2023-10", "2023-10", "2023-10", "2023-10", "2023-10", …
$ period         <chr> "Weekday evening peak", "Weekday evening peak", "Weekda…
$ origin_pt_code <fct> 01012, 01013, 01019, 01029, 01039, 01059, 01109, 01112,…
$ num_trips      <dbl> 8000, 7038, 3398, 9089, 12095, 2212, 276, 43513, 25769,…

choosing the month

In order to decide which month is better to perform LISA or whether analyzing all month separately will yield interesting result, it is a good step to check the data distribution of each month. By comparing the data distribution for each month, we can make an educated guess whether the LISA result for each month would be starkly different or just similar.

Code
# Combine odb8, odb9, and odb10 into one dataframe
combined_data <- bind_rows(
  mutate(odb8, month = "odb8"),
  mutate(odb9, month = "odb9"),
  mutate(odb10, month = "odb10")
)

# Create a standard vertical box plot
bus_boxplot <- combined_data %>%
  ggplot(aes(x = period, y = num_trips, fill = period)) + # Assigning aesthetics for x and y axes, and fill color based on period
  geom_boxplot() + # Adding the box plot layer
  facet_wrap(~month, scales = 'free_x') + # Faceting by month, with free scales for x axis
  labs(
    title = "Distribution of Trips across Peak Periods",
    subtitle = "Comparison across different months",
    x = "Period",
    y = "Number of Trips"
  ) +
  theme_minimal() + # Using a minimal theme for a cleaner look
  theme(axis.text.x = element_blank(), axis.title.x = element_blank()) # Removing x-axis category labels and label  
bus_boxplot

The box plot to show the data distribution was not too helpful as it shows that all peak time across months has extreme outliers. Therefore, the next code chunk modify the boxplot by log-transforming the number of trips.

Code
# Create a modified vertical box plot
bus_boxplot <- combined_data %>%
  ggplot(aes(x = period, y = log(num_trips), fill = period)) + # Modified aesthetics with log-transformed y-axis
  geom_boxplot() + # Adding the box plot layer
  facet_wrap(~month, scales = 'free_x') + # Faceting by month, with free scales for x axis
  labs(
    title = "Distribution of Log-Transformed Trips",
    subtitle = "Comparison across different months",
    y = "Log(Number of Trips)"
  ) +
  theme_minimal() + # Using a minimal theme for a cleaner look
  theme(axis.text.x = element_blank(), axis.title.x = element_blank()) # Removing x-axis category labels and label

bus_boxplot

The log-transformed box plot show that the distribution of each peak periods across months is quite similar. In general, number of trips in the weekday peaks are typically higher than weekend/holiday peak. The similarity also extend to extreme outliers. For example, the green box plot (Weekday morning peak) always have a single point extreme outlier on the top of the box plot. Based on this observation, it can be inferred that performing LISA on one of month is most likely enough to discover the pattern. The month October, as the latest data available, is chosen for the next processes.

extract trip numbers into wide form of peak period categories

Code
# Extract data from odb10 and store as a separate dataframe
odb10_wide <- odb10 %>%
  # Pivot the data wider, treating each row as a bus stop code with peak period trips as columns
  pivot_wider(
    names_from = period,
    values_from = num_trips
  ) %>%
  select(2:6)

DT::datatable(odb10_wide,
              options = list(pageLength = 8),
              rownames = FALSE)
Code
# check the output
glimpse(odb10_wide)
Rows: 5,072
Columns: 5
$ origin_pt_code                 <fct> 01012, 01013, 01019, 01029, 01039, 0105…
$ `Weekday evening peak`         <dbl> 8000, 7038, 3398, 9089, 12095, 2212, 27…
$ `Weekday morning peak`         <dbl> 1770, 841, 1530, 2483, 2919, 1734, 200,…
$ `Weekend/holiday evening peak` <dbl> 3061, 2770, 1685, 4063, 7263, 1118, 101…
$ `Weekend/holiday morning peak` <dbl> 2177, 1818, 1536, 3217, 5408, 1159, 116…

join aspatial and geospatial data

To retain the geospatial property, use left_join by using busstop as the main dataframe and bus stop code as the reference.

Code
odb10_sf <- left_join(busstop, odb10_wide, by = "origin_pt_code")

glimpse(odb10_sf)
Rows: 5,145
Columns: 7
$ origin_pt_code                 <fct> 22069, 32071, 44331, 96081, 11561, 6619…
$ loc_desc                       <fct> OPP CEVA LOGISTICS, AFT TRACK 13, BLK 2…
$ `Weekday evening peak`         <dbl> 67, 5, 1740, 445, 199, 349, 432, 1285, …
$ `Weekday morning peak`         <dbl> 50, NA, 2075, 411, 207, 405, 60, 3059, …
$ `Weekend/holiday evening peak` <dbl> 10, NA, 589, 47, 77, 169, 82, 492, 2016…
$ `Weekend/holiday morning peak` <dbl> 8, NA, 682, 110, 70, 177, 16, 1442, 204…
$ geometry                       <POINT [m]> POINT (13576.31 32883.65), POINT …

Creating Hexagon Spatial Grid

odb10_sf represents a spatial point dataframe, which might not be optimal for spatial autocorrelation analysis where the definition of ‘areas’ requires the representation of spatial entities as polygons instead of individual points. To address this, the subsequent code sections transform these bus stops into a grid of hexagons.

  • Utilize the st_make_grid() function to create a hexagon grid.
  • The cellsize parameter, measured in the same units as the dataframe’s spatial reference system, determines the width of each hexagon. Given that odb10_sf is projected in ESPG code 3414 with meters as the unit, the hexagon width is set to 500m.
  • Each hexagon is assigned a unique identifier, known as hex_id, which serves as the primary key for future reference.

Output: Spatial Polygons with Hexagonal Geometry and Hex_id Identification

Code
odb10_hex <- st_make_grid(
    odb10_sf,
    cellsize = 500,
    square = FALSE
  ) %>%
  st_sf() %>%
  rowid_to_column("hex_id")

# check the output
glimpse(odb10_hex)
Rows: 5,580
Columns: 2
$ hex_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ geometry <POLYGON [m]> POLYGON ((3720.122 26626.44..., POLYGON ((3720.122 27…

This code utilizes the st_make_grid function to create a hexagon grid based on the specified parameters. The resulting hexagon grid is then converted into a spatial dataframe (st_sf()) to maintain its geospatial properties. The rowid_to_column function is employed to assign a unique identifier (hex_id) to each hexagon, facilitating subsequent analyses or referencing.

Given that each hexagonal area may encompass multiple bus stops, the hex_id serves as the primary key for aggregation. The subsequent procedures outline how to organize attributes based on hex_id:

  • Utilize the st_join() function with join = st_within to associate bus stop points with hexagon areas.
  • The st_set_geometry(NULL) argument eliminates the geospatial layer, focusing on attribute aggregation.
  • Employ group_by() to obtain a unique hex_id for each row.
  • Utilize summarise() to calculate the aggregate count of bus stops and trips per peak period within each hexagon area.
  • Replace all NA values with 0 using replace(is.na(.), 0).

Output: Attribute Dataframe, with Hex_id as the Primary Key

Code
odb10_stops <- st_join(
  odb10_sf, 
  odb10_hex, 
  join = st_within
) %>%
  st_set_geometry(NULL) %>%
  group_by(hex_id) %>%
  summarise(
    n_busstops = n(),
    busstops_code = str_c(origin_pt_code, collapse = ","),
    loc_desc = str_c(loc_desc, collapse = ","),
    `Weekday morning peak` = sum(`Weekday morning peak`, na.rm = TRUE),
    `Weekday evening peak` = sum(`Weekday evening peak`, na.rm = TRUE),
    `Weekend/holiday morning peak` = sum(`Weekend/holiday morning peak`, na.rm = TRUE),
    `Weekend/holiday evening peak` = sum(`Weekend/holiday evening peak`, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  mutate(across(where(is.numeric), ~ replace_na(., 0)),
         across(where(is.character), ~ replace_na(., "")))

# check the output
glimpse(odb10_stops)
Rows: 1,524
Columns: 8
$ hex_id                         <int> 34, 65, 99, 127, 129, 130, 131, 159, 16…
$ n_busstops                     <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, …
$ busstops_code                  <chr> "25059", "25751", "26379", "25761", "25…
$ loc_desc                       <chr> "AFT TUAS STH BLVD", "BEF TUAS STH AVE …
$ `Weekday morning peak`         <dbl> 103, 52, 78, 185, 1116, 53, 60, 64, 251…
$ `Weekday evening peak`         <dbl> 390, 114, 291, 1905, 2899, 241, 368, 29…
$ `Weekend/holiday morning peak` <dbl> 0, 26, 52, 187, 455, 76, 45, 21, 39, 69…
$ `Weekend/holiday evening peak` <dbl> 56, 14, 100, 346, 634, 55, 49, 53, 132,…
  • Use left_join to combine the new odb10_stops attribute dataframe with the existing odb10_hex hexagon geospatial layer, linking them by hex_id to integrate attributes into the spatial polygon dataframe.
  • Employ filter to exclude hexagons without bus stops.

Output: Spatial Polygon Dataframe

Code
odb10_complete <- odb10_hex %>%
  left_join(odb10_stops,
            by = "hex_id"
  ) %>%
  replace(is.na(.), 0)

odb10_final <- filter(odb10_complete,
                       n_busstops > 0)

# check the output
glimpse(odb10_final)
Rows: 1,524
Columns: 9
$ hex_id                         <int> 34, 65, 99, 127, 129, 130, 131, 159, 16…
$ n_busstops                     <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, …
$ busstops_code                  <chr> "25059", "25751", "26379", "25761", "25…
$ loc_desc                       <chr> "AFT TUAS STH BLVD", "BEF TUAS STH AVE …
$ `Weekday morning peak`         <dbl> 103, 52, 78, 185, 1116, 53, 60, 64, 251…
$ `Weekday evening peak`         <dbl> 390, 114, 291, 1905, 2899, 241, 368, 29…
$ `Weekend/holiday morning peak` <dbl> 0, 26, 52, 187, 455, 76, 45, 21, 39, 69…
$ `Weekend/holiday evening peak` <dbl> 56, 14, 100, 346, 634, 55, 49, 53, 132,…
$ geometry                       <POLYGON [m]> POLYGON ((3970.122 27925.48...,…

In Step 2, the code outlines a process for associating bus stop attributes with hexagonal areas using st_join. This involves grouping and summarizing the attributes based on the unique hex_id. The resulting dataframe, odb10_stops, serves as an attribute-centric representation with the hex_id as the primary key.

Moving to Step 3, the code then combines this attribute dataframe with the original hexagon geospatial layer, odb10_hex, utilizing left_join. The resulting spatial polygon dataframe, bustraffic10, integrates both geometric and attribute information. The filter operation ensures that only hexagons containing bus stops are retained in the final dataframe.

3 Making Sense of the Data (GeoVisualization and Analysis)

Transform Data Here if exploration show sumtin glimpse (wankee) every dataset or think of a better method bus stop, population, passenger volume, subzone. Isn’t this also better be put on data checking?

3.1 Distribution of Bus Stop

Code
# Plot the hexagon grid based on n_busstops
tmap_mode("view")
busstop_map = tm_shape(odb10_final)+
  tm_fill(
    col = "n_busstops",
    palette = "Blues",
    style = "cont",
    title = "Number of Bus Stops",
    id = "loc_desc",
    showNA = FALSE,
    alpha = 0.6,
    popup.format = list(
      grid_id = list(format = "f", digits = 0)
    )
  )+
  tm_borders(col = "grey40", lwd = 0.7)
busstop_map

From the map, we can infer that the central region, likely encompassing the Central Business District (CBD) and surrounding residential areas, which are known to be highly populated and active, has a higher number of bus stops. This correlates with the need for robust public transportation in densely populated areas to support the daily commute of residents and workers. Lighter shades are observed towards the periphery of the island, suggesting fewer bus stops, which could correspond to less urbanized or industrial areas, like the Western and North-Eastern regions where the population density is lower. The map reflects Singapore’s urban planning and transportation strategy, which aims to provide accessibility and connectivity, particularly in high-density regions where demand for public transit is greatest.

3.2 Distribution of Trips Across Peak Hours

find the ideal breaks for plotting the trips using summary

Code
summary(odb10_final)
     hex_id       n_busstops     busstops_code        loc_desc        
 Min.   :  34   Min.   : 1.000   Length:1524        Length:1524       
 1st Qu.:1941   1st Qu.: 2.000   Class :character   Class :character  
 Median :2950   Median : 3.000   Mode  :character   Mode  :character  
 Mean   :2813   Mean   : 3.376                                        
 3rd Qu.:3734   3rd Qu.: 4.000                                        
 Max.   :5558   Max.   :11.000                                        
 Weekday morning peak Weekday evening peak Weekend/holiday morning peak
 Min.   :     0       Min.   :     0       Min.   :     0.0            
 1st Qu.:   909       1st Qu.:  2139       1st Qu.:   384.8            
 Median :  7532       Median :  7246       Median :  2159.5            
 Mean   : 16838       Mean   : 16135       Mean   :  4986.7            
 3rd Qu.: 23245       3rd Qu.: 16947       3rd Qu.:  6371.0            
 Max.   :386450       Max.   :542529       Max.   :109329.0            
 Weekend/holiday evening peak          geometry   
 Min.   :     0.0             POLYGON      :1524  
 1st Qu.:   529.5             epsg:3414    :   0  
 Median :  2157.5             +proj=tmer...:   0  
 Mean   :  4999.7                                 
 3rd Qu.:  5458.2                                 
 Max.   :150399.0                                 

The summary result confirmed that the trips data is right-skewed and contains extreme outliers. This knowledge is then used to customize the break in the comparison map.

the following code plot the comparison map

Code
# Define the columns for which we want to find the global min and max
value_columns <- c("Weekday morning peak", "Weekday evening peak", "Weekend/holiday morning peak", "Weekend/holiday evening peak")

# Set tmap to interactive viewing mode
tmap_mode("view")

# Define a function to create each map with a customized legend
create_map <- function(value_column) {
  tm_shape(odb10_final) +
    tm_fill(
      col = value_column,
      palette = "-RdBu",
      style = "fixed",
      title = value_column,
      id = "loc_desc",
      showNA = FALSE,
      alpha = 0.6,
      breaks = c(0, 500, 1000, 2000, 3000, 5000, 10000, 50000, 100000, 300000, 550000)
    ) +
    tm_borders(col = "grey40", lwd = 0.7) +
    tm_layout(
      legend.position = c("right", "bottom"), # Adjust legend position
      legend.frame = TRUE,
      legend.width = 0.15, # Adjust the width of the legend
      legend.height = 0.5  # Adjust the height of the legend
    )
}

# Apply the function to each value column and store the maps
map_list <- lapply(value_columns, create_map)

# Combine the maps into a 2x2 grid
combined_map <- tmap_arrange(map_list[[1]], map_list[[2]], map_list[[3]], map_list[[4]], ncol = 1, nrow = 4)

# Show the combined map
combined_map

Using the same classification scaling for all maps, it clearly shows that weekdays trips volume is higher in general than weekend/holiday. Nevertheless, at a glance, the crowded area are more or less the same across all days and time. This confirmed the previous finding on bus stops in which the area with most traffics are likely to encompassing the Central Business District (CBD) and surrounding residential areas, which are known to be highly populated and active the area of Singapore. At the same time, it also reflects Singapore’s urban planning and transportation strategy, in which the busiest area with potential high traffics are supported by more bus stops.

Distribution of Average Trips per Bus Stop

The next distribution will see which bus stop is the busiest on average, in terms of number of trips per bus stop. To do that, firstly the following code will generate new columns of trip per bus stop for each hexagon.

Code
# Create a new dataframe with transformed columns
odb10_final <- odb10_final %>%
  mutate(across(all_of(value_columns), ~ .x / n_busstops, .names = "avg_{.col}"))
# check the summary for determining break points
summary(odb10_final)
     hex_id       n_busstops     busstops_code        loc_desc        
 Min.   :  34   Min.   : 1.000   Length:1524        Length:1524       
 1st Qu.:1941   1st Qu.: 2.000   Class :character   Class :character  
 Median :2950   Median : 3.000   Mode  :character   Mode  :character  
 Mean   :2813   Mean   : 3.376                                        
 3rd Qu.:3734   3rd Qu.: 4.000                                        
 Max.   :5558   Max.   :11.000                                        
 Weekday morning peak Weekday evening peak Weekend/holiday morning peak
 Min.   :     0       Min.   :     0       Min.   :     0.0            
 1st Qu.:   909       1st Qu.:  2139       1st Qu.:   384.8            
 Median :  7532       Median :  7246       Median :  2159.5            
 Mean   : 16838       Mean   : 16135       Mean   :  4986.7            
 3rd Qu.: 23245       3rd Qu.: 16947       3rd Qu.:  6371.0            
 Max.   :386450       Max.   :542529       Max.   :109329.0            
 Weekend/holiday evening peak          geometry    avg_Weekday morning peak
 Min.   :     0.0             POLYGON      :1524   Min.   :     0          
 1st Qu.:   529.5             epsg:3414    :   0   1st Qu.:   391          
 Median :  2157.5             +proj=tmer...:   0   Median :  2598          
 Mean   :  4999.7                                  Mean   :  4513          
 3rd Qu.:  5458.2                                  3rd Qu.:  6119          
 Max.   :150399.0                                  Max.   :119816          
 avg_Weekday evening peak avg_Weekend/holiday morning peak
 Min.   :     0.0         Min.   :    0.0                 
 1st Qu.:   950.8         1st Qu.:  162.8                 
 Median :  2284.0         Median :  775.0                 
 Mean   :  4420.4         Mean   : 1349.9                 
 3rd Qu.:  4575.0         3rd Qu.: 1659.1                 
 Max.   :136001.0         Max.   :43420.0                 
 avg_Weekend/holiday evening peak
 Min.   :    0.0                 
 1st Qu.:  224.1                 
 Median :  711.1                 
 Mean   : 1380.3                 
 3rd Qu.: 1468.9                 
 Max.   :39425.0                 

the following code plot the comparison map

Code
# Define the columns for which we want to find the global min and max
value_columns <- c("avg_Weekday morning peak", "avg_Weekday evening peak", "avg_Weekend/holiday morning peak", "avg_Weekend/holiday evening peak")

# Set tmap to interactive viewing mode
tmap_mode("view")

# Define a function to create each map with a customized legend
create_map <- function(value_column) {
  tm_shape(odb10_final) +
    tm_fill(
      col = value_column,
      palette = "-RdBu",
      style = "fixed",
      title = value_column,
      id = "loc_desc",
      showNA = FALSE,
      alpha = 0.6,
      breaks = c(0, 100, 200, 300, 400, 500, 750, 1000, 1500, 5000, 10000, 50000, 140000)
    ) +
    tm_borders(col = "grey40", lwd = 0.7) +
    tm_layout(
      legend.position = c("right", "bottom"), # Adjust legend position
      legend.frame = TRUE,
      legend.width = 0.15, # Adjust the width of the legend
      legend.height = 0.5  # Adjust the height of the legend
    )
}

# Apply the function to each value column and store the maps
map_list <- lapply(value_columns, create_map)

# Combine the maps into a 2x2 grid
combined_map <- tmap_arrange(map_list[[1]], map_list[[2]], map_list[[3]], map_list[[4]], ncol = 1, nrow = 4)

# Show the combined map
combined_map

at a glance, using the average trips per bus stop shows slightly different insight. Comparatively to the total trips, average number of trips shows that the area around Jurong, Woodlands, and bus stops in Johor (part of Malaysia) is actually busier than what total trips suggest. In the context of transport policy, this might be the first lead to expand the number of bus stops in the particular area to cater for more commuters.

4 Local Indicators of Spatial Association analysis

In this section, we will employ Local Indicators of Spatial Association (LISA) statistics to explore the presence of clusters or outliers in the total number of trips generated at the hexagon layer. Specifically, we will utilize Moran’s I statistic to detect spatial patterns such as clustering or dispersion of total trips.

4.1 Constructing a Distance-Based Spatial Weights Matrix

Before delving into global/local spatial autocorrelation statistics, we need to create a spatial weights matrix for our study area. This matrix defines the neighbors of hexagonal spatial units based on their distances. Here are some considerations:

  • Each feature should have at least one neighbor, and no feature should be a neighbor with all other features.

  • For skewed data, each feature should ideally have at least 8 neighbors (we’ll start with 6).

Given the sparse distribution of bus stops in certain regions (e.g., central catchment areas, military training areas, and airports), distance-based methods are favored over contiguity methods.

Adaptive Distance-Based (Fixed Number of Neighbors) method is chosen over Fixed-Distance Threshold:. This method is chosen due to the right-skewed nature of our data. In this method, Each hexagon is guaranteed at least nneighbors, facilitating later statistical significance testing.

We will set each hexagon to have 12 neighbors using the provided R code. The process involves using st_knn to obtain a list of neighbors, and then st_weights to generate row-standardized spatial weights.

Code
wm_all <- odb10_final %>% 
  mutate(
    nb = st_knn(geometry, k = 12),
    wt = st_weights(nb, style = 'W'),
    .before = 1
  )

# check the output
kable(head(wm_all))
nb wt hex_id n_busstops busstops_code loc_desc Weekday morning peak Weekday evening peak Weekend/holiday morning peak Weekend/holiday evening peak geometry avg_Weekday morning peak avg_Weekday evening peak avg_Weekend/holiday morning peak avg_Weekend/holiday evening peak
2, 4, 5, 8, 9, 12, 13, 16, 22, 23, 38, 39 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333 34 1 25059 AFT TUAS STH BLVD 103 390 0 56 POLYGON ((3970.122 27925.48… 103 390.0 0.0 56
1, 3, 4, 5, 8, 9, 12, 13, 16, 22, 23, 38 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333 65 1 25751 BEF TUAS STH AVE 14 52 114 26 14 POLYGON ((4220.122 28358.49… 52 114.0 26.0 14
5, 6, 7, 9, 10, 13, 14, 16, 17, 18, 24, 25 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333 99 1 26379 YONG NAM 78 291 52 100 POLYGON ((4470.122 30523.55… 78 291.0 52.0 100
1, 2, 5, 8, 9, 12, 13, 16, 22, 23, 38, 39 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333 127 1 25761 OPP REC S’PORE 185 1905 187 346 POLYGON ((4720.122 28358.49… 185 1905.0 187.0 346
3, 6, 9, 10, 12, 13, 14, 16, 17, 24, 30, 31 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333 129 2 25719,26389 THE INDEX,BEF TUAS STH AVE 5 1116 2899 455 634 POLYGON ((4720.122 30090.54… 558 1449.5 227.5 317
3, 5, 7, 9, 10, 11, 13, 14, 17, 18, 24, 25 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333 130 1 26369 SEE HUP SENG 53 241 76 55 POLYGON ((4720.122 30956.57… 53 241.0 76.0 55

4.2 Global Autocorrelation of Spatial Association (Global Moran’s I with Simulation)

Global spatial association evaluates the overall spatial pattern of a variable, in this case, the total number of trips across the entire study area. It provides a single metric summarizing the extent to which similar values cluster together or disperse across the geographic space.

Compute Global Moran’s I for each peak time trips generated at the hexagon level, utilizing simulated data to avoid assumptions of normality and randomness. The number of simulations is set to 99+1 = 100.

Code
# Set the seed to ensure reproducibility
set.seed(1234)

# define the value_columns
value_columns <- c("Weekday morning peak", "Weekday evening peak", "Weekend/holiday morning peak", "Weekend/holiday evening peak")

# Create a function to perform global Moran's I test
perform_global_moran <- function(data, value_column, k) {
  # Compute spatial weights
  nb <- st_knn(data$geometry, k = k)
  wt <- st_weights(nb, style = 'W')

  # Perform global Moran's I test
  moran_result <- global_moran_perm(data[[value_column]], nb, wt, nsim = 99)
  
  # Include the value_column in the result
  result <- list(
    value_column = value_column,
    moran_result = moran_result
  )
  
  return(result)
}

# Apply the function for each time interval
results <- lapply(value_columns, function(vc) perform_global_moran(wm_all, vc, k = 12))

# Print the results
print(results)
[[1]]
[[1]]$value_column
[1] "Weekday morning peak"

[[1]]$moran_result

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.18614, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided



[[2]]
[[2]]$value_column
[1] "Weekday evening peak"

[[2]]$moran_result

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.044306, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided



[[3]]
[[3]]$value_column
[1] "Weekend/holiday morning peak"

[[3]]$moran_result

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.13804, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided



[[4]]
[[4]]$value_column
[1] "Weekend/holiday evening peak"

[[4]]$moran_result

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.084188, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided

For all four time intervals, since the p-value for global Moran’s I is smaller than 0.05, we can reject the null hypothesis that the spatial patterns are random. Moreover, as the Moran’s I statistics are greater than 0, we can infer that the spatial distribution exhibits signs of clustering for all four time intervals, consistent with the choropleth maps plotted earlier.

4.3 Local Autocorrelation of Spatial Association

Local spatial association provides a more detailed examination of spatial patterns at the local level, identifying specific areas with strong or weak spatial association. Local Moran’s I categorizes regions as high-high, low-low, high-low, or low-high, indicating clustering or outlier behavior.

Compute Local Indicators of Spatial Association (LISA) for passenger trips generated by origin at the hex level. The function local_moran from the sfdep package is utilized, automatically computing neighbor lists and weights using simulated data. The results are then unnested for further analysis and displayed in an interactive datatable format.

Code
# Create a function to perform local Moran's I analysis
get_lisa <- function(wm, value_column, k) {
  # Compute spatial weights
  nb <- st_knn(wm$geometry, k = k)
  wt <- st_weights(nb, style = 'W')

  # Perform local Moran's I analysis and create a new data frame
  result <- wm %>% 
    mutate(
      local_moran = local_moran(.data[[value_column]], nb, wt, nsim = 99),
      .before = 1
    ) %>%
    unnest(local_moran)
  
  return(result)
}

# Initialize a list to store results for each value_column
lisa_results <- list()

# Apply the function for each time interval and store results in the list
for (vc in value_columns) {
  lisa_results[[vc]] <- get_lisa(wm_all, vc, k = 6)
}

# Create a function to display a data table
display_datatable <- function(lisa_result, value_column) {
  datatable(
    lisa_result,
    class = 'cell-border stripe',
    options = list(pageLength = 5),
    caption = paste("Local Moran's I for", value_column)
  )
}

# Display data tables for each value column
for (vc in value_columns) {
  display_datatable(lisa_results[[vc]], vc)
}

4.4 Visualizing Significant Local Moran’s I at 95% Confidence Level

Utilize tmap core functions to construct choropleth maps displaying the Local Moran’s I field and p-value field for all four time intervals. Only significant values of Local Moran’s I at the 95% confidence level are plotted.

5 References

Andrey Sekste and Eduard Kazakov. “H3 hexagonal grid: Why we use it for data analysis and visualization”.

Anselin, L. (1995). “Local indicators of spatial association – LISA”. Geographical Analysis, 27(4): 93-115.

Sarah Gates (2017). “Emerging Hot Spot Analysis: Finding Patterns over Space and Time”

Sid Dhuri (2020). “Spatial Data Analysis With Hexagonal Grids”

Tin Seong Kam. “2 Choropleth Mapping with R” From R for Geospatial Data Science and Analytics

Tin Seong Kam. “9 Global Measures of Spatial Autocorrelation” From R for Geospatial Data Science and Analytics

Tin Seong Kam. “10 Local Measures of Spatial Autocorrelation” From R for Geospatial Data Science and Analytics

Land Transport Authority. Land Transport Data Mall

Wong, Kenneth. Create spatial square/hexagon grids and count points inside in R with sf from Urban Data Pallete